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Because of the advance in technologies, modern statistical stud- 
ies often encounter linear models with the number of explanatory 
variables much larger than the sample size. Estimation and variable 
selection in these high-dimensional problems with deterministic de- 
sign points is very different from those in the case of random co- 
variates, due to the identifiability of the high-dimensional regression 
parameter vector. We show that a reasonable approach is to focus 
on the projection of the regression parameter vector onto the linear 
space generated by the design matrix. In this work, we consider the 
ridge regression estimator of the projection vector and propose to 
threshold the ridge regression estimator when the projection vector 
is sparse in the sense that many of its components are small. The 
proposed estimator has an explicit form and is easy to use in ap- 
plication. Asymptotic properties such as the consistency of variable 
selection and estimation and the convergence rate of the prediction 
mean squared error are established under some sparsity conditions 
on the projection vector. A simulation study is also conducted to 
examine the performance of the proposed estimator. 

1. Introduction. Consider the following linear model: 
(1) y i = n! i (3 + e i , i = l,...,n, 

where yi is an observed response variable, x,; is a p-dimensional vector of 
observed covariates or design points associated with yi, f3 is a p-dimensional 
vector of unknown parameters and e^'s are independent and identically dis- 
tributed unobserved random errors with mean and unknown variance a 2 . 
The theory of linear models is well established for traditional applications 
where the dimension p is fixed and the sample size n > p. With modern 
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technologies, however, in many biological, medical, social and economical 
studies, p is comparable with, or much larger than, n, and making valid 
statistical inference is a great challenge. 

In the case of p < n, there is a rich literature on variable selection, that is, 
identifying nonzero components of (3 in (1). For variable selection in the case 
of p > n and statistical inference afterwards, the development of statistical 
theory started about a decade ago. Some excellent advances in asymptotic 
theory have been made recently in situations where p diverges to infinity 
as the sample size n increases to infinity with the divergence rate 0{n l ) for 
some / > (polynomial-type divergence rate) or 0(e n ) for some v S (0,1) 
(ultra- high dimension). See, for example, Fan and Peng (2004), Hunter and 
Li (2005), Meinshausen and Buhlmann (2006), Zhao and Yu (2006), Zou 
(2006), Wang, Li and Tsai (2007), Fan and Lv (2008), Zhang and Huang 
(2008), Meinshausen and Yu (2009), Wang (2009) and a review by Fan and 
Lv (2010). When Xj's are random covariates, under some conditions, some 
variable selection methods have been shown to be selection-consistent in the 
sense that, with probability tending to 1 as n — > oo, the selected variables are 
exactly those related to the response, where the probability is with respect 
to the joint distribution of (?/j,Xj)'s. As Fan and Lv (2008) commented in 
the end of their stimulating paper, however, no selection-consistency result 
is available for deterministic x^'s and many applications, such as biomedical 
imaging and signal processing, involve deterministic design points. Another 
example in which Xj can be treated as deterministic is an analysis conditional 
on the observed covariates. 

Let X be the matrix whose ith row is x^, i = 1, . . . ,n. For simplicity, we 
call X the design matrix although Xj's are not necessarily designed points. 
When p > n, a key difference between a random X and a deterministic design 
matrix is the identifiability of the regression parameter f3 in (1), caused by 
the fact that the probabilities under consideration are different. For random 
Xj's that are independent and identically distributed and independent of £j's, 
(3 = [cov(xj)] -1 cov(xj, yi). Hence, even when p> n, components of (3 can be 
estimated, and nonzero components of (3 can be identified consistently with 
respect to the joint probability distribution of (yj,Xj)'s, under some condi- 
tions on cov(xj) and cov(xj,?/j). On the other hand, when the design matrix 
is deterministic or an analysis conditional on X is considered, the underlying 
probability is the probability distribution of (y±, . . . ,y n ) conditional on X, 
and (3 is identifiable if and only if it lies in a set having a one-to-one cor- 
respondence with 7£(X), the linear space spanned by rows of X. Since the 
dimension of 7£(X) is at most n, when p> n, (3 is generally not identifiable 
with respect to the probability distribution of (y±, . . . ,y n ) conditional on X. 
Consequently, with deterministic X and p> n, it is not realistic to derive 
consistent estimators of (3 or consistent variable selection procedures. 

Without selection-consistency [as previously described; see definition (7) 
in Section 4.1], we may still derive consistent estimators of some useful func- 
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tions of (3 under the p-dimensional linear model given by (1) with determin- 
istic X and p> n. This is the main focus of the current paper. Although (3 
is generally not identifiable when p> n, we argue in Section 2 that we may 
not need to estimate the entire vector (3. For statistical analysis, 9, the pro- 
jection of (3 onto 7£(X), is what we are able to estimate, and perhaps the 
estimation of 9 is sufficient for valid statistical inference. 

To estimate 9, we first consider the ridge regression estimator in Section 3. 
For any linear combination of the ridge regression estimator, we establish 
the asymptotic convergence rate of its mean squared error. We also obtain 
the convergence rate of the expected L2-norm error for the ridge regression 
estimator of X#. This expected L2-norm error divided by n is equal to the 
average prediction mean squared error minus a 2 . 

When 9 is sparse in the sense that many of its components are small, 
we consider in Section 4 a sparse estimator of 9 obtained by thresholding 
the ridge regression estimator of 9. We show that, with probability tend- 
ing to 1 at a fast rate, we can eliminate small components of 9 and keep 
large components of 9, that is, thresholding the ridge regression estimator 
provides a variable selection procedure, that is, consistent in some sense. 
This method is computationally much simpler than methods such as the 
LASSO [Tibshirani (1996)], SCAD [Fan and Li (2001)] and the ENET [Zou 
and Hastie (2005)], since no numerical minimization is required as the pro- 
posed estimator has an explicit form. We show that the convergence rate of 
the expected L2-norm error or average prediction mean squared error of the 
thresholded ridge regression estimator is much faster than that of the ridge 
regression estimator when 9 is sparse. In particular, the thresholded ridge 
regression estimator is estimation-consistent (defined in Section 4), but the 
ridge regression estimator may not be. 

Thresholding the ridge regression estimator is closely related to the SIS as 
shown in Fan and Lv (2008). However, its asymptotic behavior for determin- 
istic X is different from that for random X, and its consistency also requires 
very different conditions. For deterministic X and p> n, there does not exist 
any result on the consistency of the LASSO, SCAD or ENET. When p <n, 
Zhang and Huang (2008) showed that the LASSO is estimation-consistent, 
but the required conditions are more stringent and complicated than those 
required for the consistency of the thresholded ridge regression estimator. 

Some simulation results are presented in Section 5 to study the estimation 
and prediction performance of the proposed method, the ridge regression, 
the LASSO and the ENET. All technical proofs are given in Section 6. 

2. Identifiability and projection. We consider model (1) with determin- 
istic design matrix X = (xi, . . . , x„)', where the dimension of Xj, p, is larger 
than n. Let r = r n be the rank of X. From the singular value decomposition, 

(2) X = PDQ', 
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where P is an n x r matrix satisfying P'P = I r , Q is a p x r matrix satisfying 
Q'Q = I r , I a denotes the identity matrix of order a and D is an r x r diagonal 
matrix of full rank. Let be a p x (p — r) matrix such that Q'Q± = (the 
matrix of O's with an appropriate order) and Q'j_Q± = l p - r - Throughout, we 
denote the q-dimensional Euclidean space by lZ q for any positive integer q 
and the subspace of W generated by the rows of X by 7£(X). 

We say that (3 in (1) is identifiable if (3 l G B, (3 2 G B and X/3 X = X/3 2 
imply /3 1 = (3 2 , where B is the parameter space of /3. The following lemma 
gives a sufficient and necessary condition for the identifiability of (3. 

Lemma 1. Under model (1) with p> r, (3 is identifiable if and only if 
there exists a known function (ft from W to 1Z P ~ V such that 

(3) B = {(3 : (3 = Q£ + Q ± 0(O, * G K r }. 

Lemma 1 reveals that identifiable /3's must be in a set having a one-to-one 
correspondence with 7£(X) = {(3:(3 = Q£,£ € 7£ r }. Since the dimension of 
the set on the right-hand side of (3) is r < n f\p (the minimum of n and p), 
f3 is typically not identifiable when p > n and, hence, we are not able to 
obtain a component-wise consistent estimator of (3. However, we may not 
need to estimate the entire vector f3, that is, if X/3 1 = X/3 2 , we can still 
estimate parameters related to X/3 X = X/3 2 and make valid inference without 
trying to distinguish (3^ and (3 2 . Therefore, we consider the projection of (3 
onto 7£(X), which is what we are able to identify in view of Lemma 1. Define 

(XX')" =PD~ 2 P', 

which is (XX')" 1 if r = n. The projection of f3 onto 7£(X) is 

(4) = X'(XXVX/3 = QQ73. 

Note that 6 G 1Z(X) and 6 = (3 if and only if (3 G 7£(X). Furthermore, X0 = 
X/3 and model (1) can be written as 

(5) y i = x' i 6 + £ i , i = l,...,n. 

Thus, estimating 9 is enough for inference about parameters X/3 = X0 and 
prediction. 

The dimension of is still p. When (3 has many zero components, may 
not have any zero component. However, 6 may have many small components. 
This can be seen from the L 2 -norms of (3 and 0. Since = QQ'/3 and QQ' 
is a projection matrix, we obtain that \\0\\ < \\(3\\, where || • || denotes the 
L 2 -norm. This implies that if (3 has many zero components so that the 
order of \\(3\\ is much smaller than 0(p), then the order of ||0|| is also much 
smaller than 0{p). Hence, if components of 6 are nonzero, then many of them 
must be negligible, and 9 can be viewed as a sparse vector. More precise 
descriptions of this sparsity can be found in conditions (C2) in Section 3 
and (C4) in Section 4. 
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3. The ridge regression estimator of the projection. Since the dimension 
of in (4) is p > n, we consider the ridge regression estimator of 6 [Hoerl 
and Kennard (1970)] under model (5). 

0=(X / X + /i n I P )" 1 X'y, 

where y = (yi, • • ■ , y n )' and h n > is an appropriately chosen regularization 
parameter. The computation of 6 involves only inverting an n x n matrix. 
This is because (2) implies that 

(6) (X'X + Kip)- 1 *' = X'(XX' + h n l n )~\ 

which also implies that the ridge regression estimator 6 is always in TZ(X). In 
fact, if (3 is the ridge regression estimator of (3 constructed under model (1), 
then G = X'(XX')~X/3 = (3. But 6 = (3 estimates 0, not the nonidentifi- 
able (3 when p> n. 

We now study the bias and variance of 6 as an estimator of 6, which is 
essential for establishing asymptotic properties of 6. For the matrix Q given 
in the singular value decomposition (2), T = (QQ±) is orthogonal, that is, 
r'r = rT' = Ip. Then 

bias(0) = E{6) - 6 

= (x'x + Mp^x'xe - e 

= -(/ i - 1 X'X + I p )- 1 6' 

= -r(/ i ~ 1 r'x / xr + i p )^ 1 r'QQ / 6> 

= -(Q(/ l - i D 2 +i r r i q±)( q o ) 

= -Q(/i- 1 D 2 + I r )- 1 Q / 6>, 

where the fourth equality follows from the fact that T is orthogonal and 
6 = QQ'/3 = QQ'0. The covariance matrix of 9 is given by 

var(0) = a 2 (X'X + /i n I p )~ 1 X / X(X'X + h n l p y l 

^'(x'x + Mp)" 1 

where A < B for nonnegative definite matrices A and B means B — A is 
nonnegative definite. 

To study the asymptotic properties of 0, we consider n — > oo and p = p n , 
a function of n. Quantities such as (3, y, Xj, etc., form triangular arrays, but 



G 



J. SHAO AND X. DENG 



the subscript n is omitted for simplicity. We assume that Ai n , the smallest 
positive eigenvalue of X'X, satisfies 

(CI) A^ 1 = 0(n~ v ), rj < 1 and rj does not depend on n. 

We also need a sparsity condition on 0. From the discussion in the end 
of Section 2, we conclude that, in terms of the L2-norm, the sparsity of (3 
implies the sparsity of 0. We assume that 

(C2) \\G\\ = 0(n T ), t <rj and r does not depend on n. 

If the number of nonzero components of (3 is 0(n 2l ~), and all absolute 
values of nonzero components of (3 are bounded by a constant M, then (C2) 
holds since ||0|| < \\(3\\ < Mn T . 

Theorem 1. Assume model (1) and conditions (CI) and (C2). 

(i) As n -> oo, E(\'e - 1'0) 2 = 0{h~ l ) + 0(/i 2 ra~ 2(?? ~ r) ) uniformly over 
p- dimensional deterministic vector 1 with |1| = 1. 

(ii) n^EpLG - X0|| 2 = O^n" 1 ) + 0(/i 2 n-( 1+? ?- 2r )). 

Note that these results hold without any condition on the dimension p. 
Theorem l(i) shows that the mean squared error of 1'0 converges to uni- 
formly in 1 if h n — > oo and h n n~^ r] ~ T ^ 1 — > 0. Theorem 1(h) gives the con- 
vergence rate of the expected L2-norm error i^Xf? — X0|| 2 for estimating 
E(y) = X0. Since the dimension of X0 is n, we say that an estimator # of 6 
is Inconsistent if n _1 i?||X , i? — X0|| 2 — > as n — > oo. Typically, r n /n does 
not converge to and, hence, X0 may not be Z^-consistent. 

To elaborate the motivation of using the expected L2" n orm error i?||Xi? — 
X0|| 2 as a performance measure for an estimator i? of 6, we consider the 
problem of predicting future y-values on deterministic X. Let y* be inde- 
pendent of y but with the same distribution as y. For deterministic X, it 
is typical to assess the accuracy of the prediction Xi9 using the average 
prediction mean squared error n -1 -E||y* — Xi9|| 2 . It turns out that 

n -1 £J||y* - X^|| 2 = a 2 + n -1 £||x£ - X0|| 2 . 

Hence, having a small expected L2- n orm error is equivalent to having a small 
average prediction mean squared error. 

4. The thresholded ridge regression estimator. The discussion in the 
previous section indicates that, although the ridge regression estimator 9 is 
consistent for the estimation of any linear combination of 6, it may not be 
inconsistent, that is, n~~ 1 E\\X.O — X0|| 2 may not converge to 0. To achieve 
inconsistency (and good prediction property) under some sparsity condi- 
tions on 0, we propose to improve the ridge regression estimator by thresh- 
olding. 
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4.1. Variable selection. Let Mp^ be the set of indices of nonzero com- 
ponents of (3, and let M be the set of indices of components of (3 selected 
using a variable selection method. The variable selection method or Ai is 
said to be selection-consistent if and only if 

(7) lim P(M=M3q) = 1. 

n— >oo 

Unlike the case of random X, for deterministic X with p> n, the selection- 
consistency defined by (7) is generally not achievable because (3 is not identi- 
fiable. Some selection-consistency results for the case of p > n and determin- 
istic X published in the literature are based on very strong and sometimes 
unrealistic conditions on the design matrix X to ensure the identifiability 
of (3. In fact, when (3 is not identifiable, it is not appropriate to use (3 to de- 
scribe usefulness of components of Xj, since two different (3 may result in the 
same responses under model (1). Although components of Xj corresponding 
to zero components of (3 are not related to yi, due to the fact that (3 is 
unknown and not identifiable, these components of Xj may still be useful in 
statistical analysis since we have to use model (5) instead of model (1), that 
is, 9 instead of (3. 

The previous discussion leads to variable selection in terms of the pro- 
jection vector 9, since any linear combination V (3 is estimable if and only 
if V (3 = Y9. However, when (3 contains many zero components, 9 may not 
have any zero component, although many components of 9 may be close to 
zero. Small but not exactly zero components of 9 do not contribute much in 
estimation but add variability. Thus, we would like to carry out variable se- 
lection in a more general sense as defined by Zhang and Huang (2008), that 
is, we try to eliminate small components of 9. Condition (C4) stated later 
may be used to define whether a component of 9 can be treated as small. 

We propose to threshold the ridge regression estimator 9. Let 6j be the jth 
components of 9, j = 1, . . . ,p. The thresholded ridge regression estimator is 
defined as 9 whose jth. component 6j = 9j if \6j \ > a n and 8j = if \0j\ < a n , 
j = 1, . . . ,p, where 

(8) a n = dn- a , 0<a<l/2,Ci > 0, 

is the thresholding value with a and C\ not depending on n. The computa- 
tion of 9 is easy since it has an explicit form. Thresholding can be viewed 
as a variable selection procedure; that is, we select components of 9 with 
indices in A4g , the set of indices of nonzero components of 9. We now 
study the asymptotic behavior of M. Q a under some conditions and appro- 
priate choices of a n and h n . A condition on the divergence rate of p = p n as 
n — > oo is 

(C3) p = 0(e n ), < v < 1 and v does not depend on n. 

If p = e n " , it is referred to as the ultra-high dimension [Fan and Lv (2010)]. 
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Theorem 2. Assume model (1) with normally distributed £j and con- 
ditions (C1)-(C3). Let a n be given by (8) with a < (n — v — r)/3, u n = 
l + (loglog?i)~ 1 and h n = C2a~ 2 (loglogn) 3 log(nV p) , where C2 > is a con- 
stant and 11V p is the maximum of n and p. Then, for any constant t>0, 

(9) P{Me, anUn C M b an c M e , an/Un ) = 1 - 0((n V p)^), 

where M.£ lCn denotes the set of indices of components of £ whose absolute 
values are larger than c n . 

Result (9) shows that, by thresholding 9, we can eliminate all components 
of 9 with absolute values less than a n /u n , but keep all components of 9 with 
absolute values larger than a n u n , with probability tending to 1 at the rate 
of 0((n Vp)~') for any t > 0. This rate is at least 0(n~') for any t > and 
it is 0(e~ tn ) for any t > if logp has exactly the order n v . 

Let q n - and q n+ be the numbers of elements in Me,a n u n and .Me,a n /u n j 
respectively. Then q n _ < q n+ . Since u n — > 1 , it is often true that q n+ — q n _ — > 
as n — > 00. Then, result (9) implies that 

(10) P(M~ 0tan =M 9 ,aJ = 1 - 0((n Vp)"*), 

which will be referred to as the consistency of A4q a . This consistency is 
weaker than the selection-consistency given by (7), but the latter may not 
be achieved. 

We now consider nonnormal under model (1), that is, the normality 
assumption on £j is replaced by 

(M) E(ef) < 00 for an even integer k not depending on n, 
and condition (C3) is replaced by 

(C3') p = 0(n), 1 < I < k/6 and I does not depend on n, 

while the other conditions, (CI) and (C2), remain the same. When the nor- 
mality condition is relaxed to the moment condition (M), we cannot handle 
a dimension at the divergence rate given by (C3), although the polynomial- 
type divergence rate given by (C3') can still be much larger than n. The inte- 
ger k in condition (M) has to be sufficiently large so that 3l(t + l)/k < rj — r, 
where t > is in the convergence rate of Ma 

tf,a n 

Theorem 2 A. Assume model (1) and conditions (M), (CI), (C2) 
and ( C3 ). For any t>0, let a n be given by (8) with a < (77— £— r)/3 and £ = 
3Z(t + l)/k, and u n = 1 + (log log n)" 1 . If h n = C 2 a~ 2 (log log n) 2 (nVp) 2 ^^ , 
where C2 > is a constant, then result (9) holds. 

4.2. L2- consistency. The following result shows that, after the variable 
selection, the thresholded estimator 9 has asymptotically smaller expected 
L2-norm error than 9, and it is in fact Inconsistent, under the following 
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sparsity condition on 0: 
(C4) 

Qn+ Qn— ->-0, q n /r n ^0 and a n v n -> 0, 

where 

|0j|, 

j: 0j|<a n 

#j is the jth component of 6, r n is the rank of X, a n is given by (8), 
and q n , q n - and q n+ are, respectively, the numbers of elements in sets Me,a n , 

M and Me,an/u n g iven b y ( 9 )- 

The last two conditions in (C4) are very similar to condition (2.4) in 

Zhang and Huang (2008); that is, there exist q n "large" components of 

with q n much smaller than the rank of X, and v n , the L\ norm of the "small" 

components of 6, may diverges to oo, but at a rate slower than a" 1 . 

Theorem 3. Assume the conditions in Theorem 2 or 2A. Assume fur- 
ther that (C4) holds and the maximum eigenvalue o/X'X is 0(n). Then 

(11) rT^EpLO - X0|| 2 = 0{q n n~ l ) + 0(v n a n ) + 0{h 2 n n^ l+ ^ 2 ^). 

Result (11) shows the gain of variable selection by thresholding. The ex- 
pected L2-norm error n - X E\\X£ - X0|| 2 is smaller than n^EpLO - X0|| 2 
for sufficiently large n. The former converges to at a certain rate and 
hence 9 is inconsistent, whereas the latter may not converge to when 
r n /n does not converge to 0. 

If q n /n — > 0, result (11) can also be established with the vector of nonzero 
components of replaced by the ordinary least squares estimator of the 
sub- vector of 6 indexed by the set M.~ a 

4.3. Tuning parameters. To apply thresholding, we need to choose the 
constants C\ in the thresholding value a n given by (8) and in the regu- 
larization parameter h n given in Theorem 2 or 2A. Similar to many other 
problems, C\ and C2 can be viewed as tuning parameters, and there is no 
optimal way to find their values. Some discussions can be found, for exam- 
ple, in Fan and Lv (2008). It is possible to use a data-driven method to 
find values of tuning parameters by minimizing the average prediction mean 
squared error n~ 1 E\\y* — X0|| 2 = a 2 + n~ l E\\XB — X#|| 2 . 

Let ip(C) be the average prediction mean squared error when C = (C\, C2) 
is used in a n and h n . Since ip{C) is unknown, we minimize the cross- 
validation estimator 

1 n 



10 



J. SHAO AND X. DENG 



where is the thresholded ridge regression estimator of based on the 

data set with (yj,Xj) removed, i = 1, . . . , n. To avoid repeated computation 

~ (c) 

of , we may use an equivalent formula for ip(C), 

~ (c) 

where u>j(C) = x^(X / X + /i n / p ) _1 Xj and is the thresholded ridge regres- 
sion estimator based on the whole data set. This method is applied in the 
simulation study presented in the next section. 

5. Simulation results. With deterministic X and p > n, we examined 
the L2-norm errors and the expected L2-norm errors of the ridge regres- 
sion estimator, the thresholded ridge regression estimator, and the popular 
LASSO estimator and ENET estimator (for comparison purpose) in four 
simulation studies. In the first two simulation studies, the design matrix X 
was generated from a multivariate normal distribution but fixed throughout 
the simulation, which corresponds to analysis conditional on X. In the last 
two simulation studies, X is a nearly orthogonal Latin hypercube design or 
a Latin hypercube design. 

5.1. Simulation study I. We considered linear model (1) with normally 
distributed £j and a = 10. Three sets of sample and variable sizes were con- 
sidered, (n,p) = (30,100), (100,500) and (200,2000), with increasing ratio 
p/n. A set of xi, . . . ,x„ were independently generated with Xj ~ N(0, S), 
where the diagonal elements of S are all equal to 1 and off-diagonal elements 
of X! are all equal to 0.75. This set of X was fixed throughout the simula- 
tion. The first 20 components of f3 are 1 + O.lj for j = 1, . . . , 20, and the rest 
of the components of f3 are all equal to 0. The L2 cumulative proportion 
plot of the projection vector 6, that is, X^j=i ^)/II^II 2 j k = 1, . . . ,p, is given 
in Figure 1, where 9?^ is the jth ordered value of 9f,...,9p. Although /3 
has many zero components, 6 does not have any zero component but many 
components of 6 are small. 

For the thresholded ridge regression estimator, we selected the tuning 
parameter C = {Ci^C^j by minimizing tp(C) given by (12). For the ridge 
regression, LASSO, and ENET estimators, the tuning parameters were se- 
lected by a 5-fold cross-validation. 

Let i9 denote the thresholded ridge regression estimator 6, the ridge re- 
gression estimator 0, the LASSO estimator or the ENET estimator. We in- 
dependently generated 100 values of y and obtained 100 values of n~ l ||X/3 — 
Xi9|| 2 , the L2-norm error (divided by the sample size). Box plots of 100 val- 
ues of ?i _1 ||X/3 — Xi9|| 2 for four estimation methods are given in Figure 1. 
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20 40 60 80 100 Thres. Ridge Lasso Enet Ridge 

Index 



n=100, p=500 n=100, p=500 




T 1 1 1 1 1 1 1 1 — 

100 200 300 400 500 Thres. Ridge Lasso Enet Ridge 

Index 



n=200, p=2000 n=200, p=2000 




"i 1 1 1 1 1 1 1 — 

500 1000 1500 2000 Thres. Ridge Lasso Enet Ridge 

Index 



Fig. 1. Study I: L2 cumulative proportion plot of and box plots of L2 -norm error for 
the thresholded ridge regression, LASSO, ENET and ridge regression. 

The average of 100 values of n -1 ||X/3 — X$|| 2 , a simulation approximation 
to the expected L2-norm error n -1 .E||X/3 — X#|| 2 , is listed in Table 1 for 
each of the four methods. 

5.2. Simulation study II. The setting in this study is the same as that 
in simulation study I except that the values of Xj 's were generated with a X! 
whose (fe,/)th element is equal to (0.5)' fc_ '' when \k — l\ < 10 and when 
\k — l\ > 10. The L2 cumulative proportion plot of and box plots of values of 
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Table 1 

Simulation approximation to the expected Li-norm error 



Method 



Study 


n 


P 


Thres. Ridge 


LASSO 


ENET 


Ridge 


I 


30 


100 


27.34 


48.46 


44.56 


51.48 




100 


500 


24.72 


32.01 


28.46 


44.32 




200 


2000 


21.86 


25.37 


24.17 


49.37 


II 


30 


100 


56.50 


69.05 


70.70 


76.05 




100 


500 


59.35 


68.33 


64.43 


94.06 




200 


2000 


74.59 


85.14 


82.35 


100.75 


III 


49 


96 


61.58 


78.40 


76.83 


85.46 




64 


192 


54.79 


81.54 


79.78 


78.34 


IV 


30 


100 


43.44 


55.35 


49.29 


59.72 




100 


500 


46.49 


56.60 


52.83 


65.85 




200 


2000 


48.53 


51.78 


56.26 


71.21 



n 1 ||X/3 — X#|| 2 based on 100 simulation runs for four estimation methods 
are given in Figure 2. The simulation approximations to n^ 1 E\\~K{3 — X$|| 2 
are included in Table 1. 

5.3. Simulation study III. Let NOLH(n,p) denote a nearly orthogonal 
Latin hypercube design with n rows (runs) and p columns (variables). We 
considered two sets of n and p. In the first case, n = 49, p = 96 and X is an 
NOLH(49,96) constructed by using the orthogonal array-based method in 
Lin, Mukerjee and Tang (2009). In the second case, n = 64, p = 192 and X 
is an NOLH(64, 192). In both cases, the first 15 components of f3 are equal 
to 0.2, 0.4, 2.8, 3.0, and the rest components of (3 are equal to 0. The 
standard deviation of £j is 8. The rest of the simulation setting is the same 
as that in simulation study I. The L2 cumulative proportion plot of 6 and 
box plots of values of n, _1 ||X/3 — X#|| 2 based on 100 simulation runs for four 
estimation methods are given in Figure 3. The simulation approximations 
to n _1 £||X/3 - Xtf|| 2 are included in Table 1. 

5.4. Simulation study IV. The setting in this study is the same as that 
in simulation study I except that X is a deterministic Latin hypercube de- 
sign [McKay, Beckman and Conover (1979)]: each column of X is a random 
permutation of n points 6(i/n) — 3, i = 1, . .. , n, and all columns are gener- 
ated independently. The L2 cumulative proportion plot of 6 and box plots 
of values of n _1 ||X/3 — X$|| 2 based on 100 simulation runs for four esti- 
mation methods are given in Figure 4. The simulation approximations to 
n _1 .E||X/3 — Xi9|| 2 are included in Table 1. 

5.5. Conclusions based on simulation studies. From Table 1 and Fig- 
ures 1-4, we conclude that the thresholded ridge regression estimator is 
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n=30, p=100 n=30, p=100 




"i 1 1 1 1 1 1 1 1 — 

100 200 300 400 500 Thres. Ridge Lasso Enet Ridge 

Index 



n=200, p=2000 n=200, p=2000 




Index 

Fig. 2. Study II: L2 cumulative proportion plot of and box plots of L^-norm error for 
the thresholded ridge regression, LASSO, ENET and ridge regression. 

much better than the ridge regression estimator in terms of the L2-norm 
error or the expected L2-norm error, which supports our asymptotic theory, 
that is, the thresholded ridge regression estimator is ^-consistent whereas 
the ridge regression estimator is not. Because the expected L2-norm error 
is linearly related to the average prediction mean squared error (Section 3), 
these results show that thresholding ridge regression has better prediction 
performance. Except for study III, the LASSO performs worse than the 
ENET and thresholded ridge regression, but better than the ridge regres- 
sion, and the ENET performs worse than the thresholded ridge regression, 



14 



J. SHAO AND X. DENG 



n=49, p=96 n=49. p=96 




50 100 150 Thres. Ridge Lasso Enet 



Index 

Fig. 3. Study III: L2 cumulative proportion plot of 9 and box plots of L2 -norm error for 
the thresholded ridge regression, LASSO, ENET and ridge regression. 

although the difference is small in some cases. Since the ENET uses a com- 
bination of L\- and L2-penalty, it is not surprising that its performance is 
between the LASSO and thresholded ridge regression. However, both LASSO 
and ENET have large variability in simulation study III. It is well known 
that the LASSO requires more stringent conditions on the design matrix X 
[e.g., Zhao and Yu (2006)]. The nearly orthogonal Latin hypercube design 
in simulation study III may not satisfy these conditions, which results in 
the poor performance of the LASSO. This also applies to the ENET, since 
it uses Li-penalty. Furthermore, no result for the Inconsistency of LASSO 
or ENET is available in the situation of deterministic X and p> n. 

In terms of the computation, the thresholded ridge regression is much 
simpler than the LASSO or ENET, especially when p is very large. Because 
of the identity (6), the computation complexity of the thersholded ridge 
regression estimator does not increase as p increases. 

6. Proofs. 

Proof of Lemma 1. Suppose that (3) holds. Let /3j G B, j = 1, 2. Then 
there are ^ E TZ r such that pj = + Q_l</>(^), j = 1, 2. If X^ = X/3 2 , 



ESTIMATION IN HIGH-DIMENSIONAL LINEAR MODELS 15 
n=30, p=100 n=30, p=100 




500 1000 1500 2000 Thres.Ridge Lasso Enet Ridge 

Index 



n=100, p=500 n=100, p=500 




~i 1 1 1 1 1 1 1 1 — 

100 200 300 400 500 Thres.Ridge Lasso Enet Ridge 

Index 



n=200, p=2000 n=200, p=2000 




500 1000 1500 2000 Thres.Ridge Lasso Enet Ridge 

Index 

Fig. 4. Study IV: L2 cumulative proportion plot of and box plots of L2 -norm error for 
the thresholded ridge regression, LASSO, ENET and ridge regression. 



then, by (2), PD^ = PD£ 2 and, thus, £1 = £ 2 , which implies Pi = (3 2 - This 
shows that the parameter (3 in (1) is identifiable. 

Suppose now that B is not of the form (3). Then, there exist £ € 1Z T , 
Cj e j = 1, 2, Ci C 2 and f3 J = Q£ + Q ± £ 3 e B. Then (3 1 ? (3 2 , but 

X/3j = PD£ = X/3 2 - This shows that /3 in (1) is not identifiable. □ 

Proof of Theorem 1. 

(i) From Section 3, bias(0) = -Q(/i n ; 1 D 2 + I^^Q'fl. From the facts 
that Q'Q = I r , D 2 contains positive eigenvalues of X'X, and (/i~ 1 D 2 + 
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Ir)" 1 < j^j^Ir, we obtain that ||bias(0)|| < \\0\\(h n /\ ln ). Hence, by (CI) 

and (C2), [l'bias(0)] 2 < ||bias(0)|| 2 = 0(/#n -2 fa- T >) uniformly over 1 with 
||1|| = 1. Also, from Section 3, var(0) < a 2 h-%. Hence, l'var(0)l = 0{h~ l ) 
uniformly over 1 with |1| = 1. Then, the result follows from E(1'0 — 1'0) 2 = 
l'var(0)l + [l'bias(0)] 2 . 

(ii) Note that E\\X0 - X0|| 2 = trace[X var(0)X'] + ||Xbias(0)|| 2 . From 
the proof of (i), 

Xvar(0)X' < o- 2 X(X'X + Kip)' 1 *! 

= cj 2 PD(D 2 + /i n I r ) _1 DP / 
< a 2 PP', 

since D(D 2 + /i n I r )~ 1 D is a diagonal matrix whose diagonal elements are 
bounded by 1. Hence, trace[X var(0)X'] < a 2 trace(PP') = a 2 r n . Also, 

||Xbias(0)|| 2 = O'Qih- 1 ^ 2 + I r )~ 1 D 2 (/ l - 1 D 2 + I^Q'G < h 2 n X^\\6\\ 2 , 

which is 0(/i 2 n~^~ 2r ^) by (CI) and (C2). This completes the proof. □ 

Proof of Theorem 2. From the proof of Theorem 1, 

bias(%) = 0(\\e\\h n /X ln ) = 0{h n /n^ T ) 

uniformly in j = 1, . . . ,p. For sufficiently large re, log log n > 0. With h n = 
C2a~ 2 (loglogn) 3 log(nVp) and condition (C3), 

h n _ C 2 (loglogn) 4 log(n V p) c\ (log log re) 4 



n r >- T (u n - l)a n n^~ T a^ ~ n v-^-r-3a 

for some constant c\ > and, hence, | bias(#j)|/[(ii ra — l)a n ] —> uniformly 
in j when a < (re — v — r)/3. Since var(#j) = 0(h~ l ), there is a constant 
Co > such that 

|bias(0j)| - (u n - l)a n /- r— 



[var(^)]V2 

Let <3? be the standard normal distribution function. From (1) with normally 
distributed £,-, 



P(\0j-0j\ >(«n-lK)< 2$ 



bias(gj)| - (u n - l)q n 
[var(^-)] 1/2 



< 2$(-^/2co^A^an/(loglogre)) 

< exp{-co/i„a 2 /(loglogre) 2 }, 
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for sufficiently large n, where the last inequality follows from 2<E>(— x) < 
e~ x I 2 for x > 2 and the fact that h n a^/ (log log n) 2 = C2 log log n log(n Vp) — >■ 
00. Using the same argument, we also obtain that 

P(\6j - 6j\ > (1 - u" 1 )^) < exp{-co/i n an/( lo g lo g n ) 2 } 
for sufficiently large n. Let t > be given. For sufficiently large n, c^Ci log log n — 
1 > t and, hence, 



P(M , anUn c M^J >1-p( |J {&] < 

j: \8j\>u n a n 

>1-P( U {I^-%I>K-1K}) 



j : \0j\>u„a n 
P 

>l-Y, P (\0j-Qj\>(un-l)a n ) 
i=i 

> 1 -pexp{-Co/i n a 2 /(loglogn) 2 } 

> 1 - (nV p)~*. 

Similarly, for any t > 0, 



p(M^ an c >t e>ari/u j > p( n a-i ^ °«}) 

J : |%|<1n/«n 

>l-i>( (J {l^-fljXl-Oon}) 



j : |0j|<a n /u,j 

> 1 -pexp{-co/i„a^/(loglogn) 2 } 

> 1 - (jiVp)"' 

for sufficiently large n. This completes the proof. □ 

PROOF of Theorem 2A. From the proof of Theorem 1, we still have 
bias(#j) = 0(h n /n v ~ T ) uniformly in j = 1, . . . ,p. Let Q be the jth compo- 
nent of (X'X + /i n I P )" 1 Er=i x i(yi- x ^)- Th en, for u n = 1 + (log log n)" 1 , 

P(|^-,,|> K -iK)<|^^ 

_/ |bias(%)| fc + £((*; 
V [(u n - l)a n ] fc 

/ ^(loglogn) fc \ / (log log n)* 



n k( V -r) a k J y h k/2 a K 
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where the last equality follows from E{^) = 0(hn k/2 ) [Whittle (1960), The- 
orem 2]. Similarly, 



P( % - e,i > (i - „->„> ^o(M^) + o( <™ ) . 

Using /i n = C2a~ 2 (loglogra) 2 (ra Vp) 2 ^' 3 ^, we obtain that 
P(M d)0n c ^W«J > 1 - - fil > (1 - 

3=1 

/ V4(loglogn) fc \ _ ^p(loglogn) fe 



1-0 
-O 
1-0 
-O 



fl n / V /in a* 

p(n V p) 2fe ?/ ( 3 ') (log log n) 3h " 



n k(ri-r) a 3k 

P 



(nVp)W( 3i ' 

p(n V p) k ^' l (log logn) 3fc 
(n V pyt+^n^-Za-T) 

P 



n fcg (loglog?i) 3fc 



(n V pyn^-Sa-r) J \(nV p) 1 
= 1 - o((n Vp)~ 4 ) - 0((n Vp)~') 
= l-0((nVp)-*), 
since k£/(3l) = t + l and a < (7/ — £ — t)/3. Similarly, 

PC^e,^ C .M^J > 1 - 0((n Vp)"*). 
Hence, result (9) follows. □ 

Proof of Theorem 3. Let A n = {Mg a = Mo An } and A c n be its 
complement. On the set A n , the number of nonzero components of 6 is the 
same as q n . Let 0\ be 6 with its components smaller than a n in absolute value 
set to 0. Under condition (C4) and the condition that X'X has a maximum 
eigenvalue bounded by era for a constant c, 

tTHXBi -X0|| 2 <c||0i -0\\ 2 



E 

j: \6j\<a„ 
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<ca n ^ 

j: \6j\<a n 

= 0(v n a n ). 

Hence, 

n _1 £;||X0 - X0|| 2 < 2n~ l {E\\X.e - X0i || 2 + ||X0i - X0|| 2 ) 
= 2n- 1 E\\X6 - X0i || 2 + 0(v n a n ). 
Then, it remains to show that 

(13) n _1 £||X0 - X6>i|| 2 = 0{q n n~ l ) + 0(v n a n ) + 0(/i 2 n- (1+ ^ 2r) ). 
Following the proof of Theorem 1 we obtain that 

n^ElWXB - X0 1 || 2 / A J = 0(q n n- 1 ) + 0(/£„-(i-HH*)), 
where I a is the indicator of the set A. From 

llXe-Xflill 2 /^ < 211X0 -Xe\\ 2 I A c + 211X0 -X0i|| 2 /^c 

II J-ll s-L n — II II sT- n II 1 1 

and Theorem 1, result (13) follows if we can show that 

n^EWXO - Xe\\ 2 I A c n = oiqnn' 1 V /i 2 n"( 1+,? - 2T) ). 

Since 

||X0 - X0|| 2 = (0 - 0)'X'X(0 - 0) < 0{n)\\6 - 0|| 2 < 0{a 2 n pn), 

the result follows from P{A ( j l ) = 0((n Vp)~') for any t > according to 
Theorem 2 or 2A. This completes the proof. □ 
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